You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2005/12/07 15:11:08 UTC

[jira] Created: (NUTCH-134) Summarizer doesn't select the best snippets

Summarizer doesn't select the best snippets
-------------------------------------------

         Key: NUTCH-134
         URL: http://issues.apache.org/jira/browse/NUTCH-134
     Project: Nutch
        Type: Bug
  Components: searcher  
    Versions: 0.7.1, 0.7, 0.7.2-dev, 0.8-dev    
    Reporter: Andrzej Bialecki 


Summarizer.java tries to select the best fragments from the input text, where the frequency of query terms is the highest. However, the logic in line 223 is flawed in that the excerptSet.add() operation will add new excerpts only if they are not already present - the test is performed using the Comparator that compares only the numUniqueTokens. This means that if there are two or more excerpts, which score equally high, only the first of them will be retained, and the rest of equally-scoring excerpts will be discarded, in favor of other excerpts (possibly lower scoring).

To fix this the Set should be replaced with a List + a sort operation. To keep the relative position of excerpts in the original order the Excerpt class should be extended with an "int order" field, and the collected excerpts should be sorted in that order prior to adding them to the summary.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12378170 ] 

Andrzej Bialecki  commented on NUTCH-134:
-----------------------------------------

I still prefer Summary as Writable. The reason is that there are users of Summary that don't want a single String with HTML formatting - recovering from this format is tedious and error-prone. On the other hand, returning a Writable may have performance implications.

> Summarizer doesn't select the best snippets
> -------------------------------------------
>
>          Key: NUTCH-134
>          URL: http://issues.apache.org/jira/browse/NUTCH-134
>      Project: Nutch
>         Type: Bug

>   Components: searcher
>     Versions: 0.7.2, 0.7.1, 0.7, 0.8-dev
>     Reporter: Andrzej Bialecki 
>  Attachments: summarizer.060506.patch
>
> Summarizer.java tries to select the best fragments from the input text, where the frequency of query terms is the highest. However, the logic in line 223 is flawed in that the excerptSet.add() operation will add new excerpts only if they are not already present - the test is performed using the Comparator that compares only the numUniqueTokens. This means that if there are two or more excerpts, which score equally high, only the first of them will be retained, and the rest of equally-scoring excerpts will be discarded, in favor of other excerpts (possibly lower scoring).
> To fix this the Set should be replaced with a List + a sort operation. To keep the relative position of excerpts in the original order the Excerpt class should be extended with an "int order" field, and the collected excerpts should be sorted in that order prior to adding them to the summary.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

Posted by "Chris Fellows (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12377654 ] 

Chris Fellows commented on NUTCH-134:
-------------------------------------

byron,

Did you ever get a chance to run a cpu perf test on using lucene/contrib/highlighter for extracting summaries?

chris

> Summarizer doesn't select the best snippets
> -------------------------------------------
>
>          Key: NUTCH-134
>          URL: http://issues.apache.org/jira/browse/NUTCH-134
>      Project: Nutch
>         Type: Bug

>   Components: searcher
>     Versions: 0.7.2, 0.7.1, 0.7, 0.8-dev
>     Reporter: Andrzej Bialecki 

>
> Summarizer.java tries to select the best fragments from the input text, where the frequency of query terms is the highest. However, the logic in line 223 is flawed in that the excerptSet.add() operation will add new excerpts only if they are not already present - the test is performed using the Comparator that compares only the numUniqueTokens. This means that if there are two or more excerpts, which score equally high, only the first of them will be retained, and the rest of equally-scoring excerpts will be discarded, in favor of other excerpts (possibly lower scoring).
> To fix this the Set should be replaced with a List + a sort operation. To keep the relative position of excerpts in the original order the Excerpt class should be extended with an "int order" field, and the collected excerpts should be sorted in that order prior to adding them to the summary.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

Posted by "byron miller (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12361350 ] 

byron miller commented on NUTCH-134:
------------------------------------

Where is the lucene summarizer from the contrib?  i'm not seeing anything obvious (unless it's under a different name)

> Summarizer doesn't select the best snippets
> -------------------------------------------
>
>          Key: NUTCH-134
>          URL: http://issues.apache.org/jira/browse/NUTCH-134
>      Project: Nutch
>         Type: Bug
>   Components: searcher
>     Versions: 0.7.1, 0.7, 0.7.2-dev, 0.8-dev
>     Reporter: Andrzej Bialecki 

>
> Summarizer.java tries to select the best fragments from the input text, where the frequency of query terms is the highest. However, the logic in line 223 is flawed in that the excerptSet.add() operation will add new excerpts only if they are not already present - the test is performed using the Comparator that compares only the numUniqueTokens. This means that if there are two or more excerpts, which score equally high, only the first of them will be retained, and the rest of equally-scoring excerpts will be discarded, in favor of other excerpts (possibly lower scoring).
> To fix this the Set should be replaced with a List + a sort operation. To keep the relative position of excerpts in the original order the Excerpt class should be extended with an "int order" field, and the collected excerpts should be sorted in that order prior to adding them to the summary.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

Posted by "byron miller (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12363400 ] 

byron miller commented on NUTCH-134:
------------------------------------

Thanks Erik, I was able to pull down the highlighter and i'll be loading it up on mozdex.com to test out over the weekend (1/21/2006).  i'll let people know if my cpu skyrockets :)

> Summarizer doesn't select the best snippets
> -------------------------------------------
>
>          Key: NUTCH-134
>          URL: http://issues.apache.org/jira/browse/NUTCH-134
>      Project: Nutch
>         Type: Bug
>   Components: searcher
>     Versions: 0.7, 0.8-dev, 0.7.1, 0.7.2-dev
>     Reporter: Andrzej Bialecki 

>
> Summarizer.java tries to select the best fragments from the input text, where the frequency of query terms is the highest. However, the logic in line 223 is flawed in that the excerptSet.add() operation will add new excerpts only if they are not already present - the test is performed using the Comparator that compares only the numUniqueTokens. This means that if there are two or more excerpts, which score equally high, only the first of them will be retained, and the rest of equally-scoring excerpts will be discarded, in favor of other excerpts (possibly lower scoring).
> To fix this the Set should be replaced with a List + a sort operation. To keep the relative position of excerpts in the original order the Excerpt class should be extended with an "int order" field, and the collected excerpts should be sorted in that order prior to adding them to the summary.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12377733 ] 

Jerome Charron commented on NUTCH-134:
--------------------------------------

Since we can imagine a lot of Summarizer implementation, and that each kind of Nutch deployment can have some specific needs about Summarizer, I suggest to create :

1. a Summarizer extension point, so that we can easily switch from one implementation to another one.
2. a summarizer-basic plugin that is in fact the current summarizer implementation.
3. a summarizer-lucene plugin that wraps the lucene's summarizer.

If everybody is ok, i will implement this.

> Summarizer doesn't select the best snippets
> -------------------------------------------
>
>          Key: NUTCH-134
>          URL: http://issues.apache.org/jira/browse/NUTCH-134
>      Project: Nutch
>         Type: Bug

>   Components: searcher
>     Versions: 0.7.2, 0.7.1, 0.7, 0.8-dev
>     Reporter: Andrzej Bialecki 

>
> Summarizer.java tries to select the best fragments from the input text, where the frequency of query terms is the highest. However, the logic in line 223 is flawed in that the excerptSet.add() operation will add new excerpts only if they are not already present - the test is performed using the Comparator that compares only the numUniqueTokens. This means that if there are two or more excerpts, which score equally high, only the first of them will be retained, and the rest of equally-scoring excerpts will be discarded, in favor of other excerpts (possibly lower scoring).
> To fix this the Set should be replaced with a List + a sort operation. To keep the relative position of excerpts in the original order the Excerpt class should be extended with an "int order" field, and the collected excerpts should be sorted in that order prior to adding them to the summary.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12359629 ] 

Andrzej Bialecki  commented on NUTCH-134:
-----------------------------------------

I _think_ the Lucene summarizer requires more CPU than this one... but this has to be checked. I'll work on that.

> Summarizer doesn't select the best snippets
> -------------------------------------------
>
>          Key: NUTCH-134
>          URL: http://issues.apache.org/jira/browse/NUTCH-134
>      Project: Nutch
>         Type: Bug
>   Components: searcher
>     Versions: 0.7, 0.8-dev, 0.7.1, 0.7.2-dev
>     Reporter: Andrzej Bialecki 

>
> Summarizer.java tries to select the best fragments from the input text, where the frequency of query terms is the highest. However, the logic in line 223 is flawed in that the excerptSet.add() operation will add new excerpts only if they are not already present - the test is performed using the Comparator that compares only the numUniqueTokens. This means that if there are two or more excerpts, which score equally high, only the first of them will be retained, and the rest of equally-scoring excerpts will be discarded, in favor of other excerpts (possibly lower scoring).
> To fix this the Set should be replaced with a List + a sort operation. To keep the relative position of excerpts in the original order the Excerpt class should be extended with an "int order" field, and the collected excerpts should be sorted in that order prior to adding them to the summary.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

Posted by "Chris Fellows (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12377866 ] 

Chris Fellows commented on NUTCH-134:
-------------------------------------

Jerome,

Let me know if you could use a hand in implementation. I'd like to get to know nutch and lucene code base better for my project. This looks like a good area to start in, so any opportunity to jump in would be great.

chris

> Summarizer doesn't select the best snippets
> -------------------------------------------
>
>          Key: NUTCH-134
>          URL: http://issues.apache.org/jira/browse/NUTCH-134
>      Project: Nutch
>         Type: Bug

>   Components: searcher
>     Versions: 0.7.2, 0.7.1, 0.7, 0.8-dev
>     Reporter: Andrzej Bialecki 

>
> Summarizer.java tries to select the best fragments from the input text, where the frequency of query terms is the highest. However, the logic in line 223 is flawed in that the excerptSet.add() operation will add new excerpts only if they are not already present - the test is performed using the Comparator that compares only the numUniqueTokens. This means that if there are two or more excerpts, which score equally high, only the first of them will be retained, and the rest of equally-scoring excerpts will be discarded, in favor of other excerpts (possibly lower scoring).
> To fix this the Set should be replaced with a List + a sort operation. To keep the relative position of excerpts in the original order the Excerpt class should be extended with an "int order" field, and the collected excerpts should be sorted in that order prior to adding them to the summary.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

Posted by "Erik Hatcher (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12361351 ] 

Erik Hatcher commented on NUTCH-134:
------------------------------------

Byron - It's under contrib/highlighter.   For Nutch, which uses Lucene's trunk version, you'll want to build the Highlighter from scratch against that version of Lucene.

> Summarizer doesn't select the best snippets
> -------------------------------------------
>
>          Key: NUTCH-134
>          URL: http://issues.apache.org/jira/browse/NUTCH-134
>      Project: Nutch
>         Type: Bug
>   Components: searcher
>     Versions: 0.7.1, 0.7, 0.7.2-dev, 0.8-dev
>     Reporter: Andrzej Bialecki 

>
> Summarizer.java tries to select the best fragments from the input text, where the frequency of query terms is the highest. However, the logic in line 223 is flawed in that the excerptSet.add() operation will add new excerpts only if they are not already present - the test is performed using the Comparator that compares only the numUniqueTokens. This means that if there are two or more excerpts, which score equally high, only the first of them will be retained, and the rest of equally-scoring excerpts will be discarded, in favor of other excerpts (possibly lower scoring).
> To fix this the Set should be replaced with a List + a sort operation. To keep the relative position of excerpts in the original order the Excerpt class should be extended with an "int order" field, and the collected excerpts should be sorted in that order prior to adding them to the summary.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

Posted by "Steven Yelton (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12378063 ] 

Steven Yelton commented on NUTCH-134:
-------------------------------------

Andrzej, my solution to this problem was to fix the comparator to actually compare the fragments if numFragments() was the same for both excerpts.  Sounds like there are grander plans afoot, but this got me past my problem of only seeing one summary fragment when I actually had 3 (they were seen as equal so only the last was on the set).

Steven

> Summarizer doesn't select the best snippets
> -------------------------------------------
>
>          Key: NUTCH-134
>          URL: http://issues.apache.org/jira/browse/NUTCH-134
>      Project: Nutch
>         Type: Bug

>   Components: searcher
>     Versions: 0.7.2, 0.7.1, 0.7, 0.8-dev
>     Reporter: Andrzej Bialecki 

>
> Summarizer.java tries to select the best fragments from the input text, where the frequency of query terms is the highest. However, the logic in line 223 is flawed in that the excerptSet.add() operation will add new excerpts only if they are not already present - the test is performed using the Comparator that compares only the numUniqueTokens. This means that if there are two or more excerpts, which score equally high, only the first of them will be retained, and the rest of equally-scoring excerpts will be discarded, in favor of other excerpts (possibly lower scoring).
> To fix this the Set should be replaced with a List + a sort operation. To keep the relative position of excerpts in the original order the Excerpt class should be extended with an "int order" field, and the collected excerpts should be sorted in that order prior to adding them to the summary.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12359626 ] 

Doug Cutting commented on NUTCH-134:
------------------------------------

Can we yet replace Nutch's summarizer with the summarizer in Lucene's contrib directory?  Are there features that Nutch requires that that does not yet implement?  It's a shame to maintain two summarizers.  When I first wrote Nutch's summarizer there was no Lucene contrib summarizer...

> Summarizer doesn't select the best snippets
> -------------------------------------------
>
>          Key: NUTCH-134
>          URL: http://issues.apache.org/jira/browse/NUTCH-134
>      Project: Nutch
>         Type: Bug
>   Components: searcher
>     Versions: 0.7, 0.8-dev, 0.7.1, 0.7.2-dev
>     Reporter: Andrzej Bialecki 

>
> Summarizer.java tries to select the best fragments from the input text, where the frequency of query terms is the highest. However, the logic in line 223 is flawed in that the excerptSet.add() operation will add new excerpts only if they are not already present - the test is performed using the Comparator that compares only the numUniqueTokens. This means that if there are two or more excerpts, which score equally high, only the first of them will be retained, and the rest of equally-scoring excerpts will be discarded, in favor of other excerpts (possibly lower scoring).
> To fix this the Set should be replaced with a List + a sort operation. To keep the relative position of excerpts in the original order the Excerpt class should be extended with an "int order" field, and the collected excerpts should be sorted in that order prior to adding them to the summary.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

Posted by "Dawid Weiss (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12378387 ] 

Dawid Weiss commented on NUTCH-134:
-----------------------------------

(back from holidays, so a bit delayed, but) I confirm Andrzej's suggestion -- a plain-text only summarized is ideal for clustering for example. HTML is quite uncomfortable to work with.

> Summarizer doesn't select the best snippets
> -------------------------------------------
>
>          Key: NUTCH-134
>          URL: http://issues.apache.org/jira/browse/NUTCH-134
>      Project: Nutch
>         Type: Bug

>   Components: searcher
>     Versions: 0.7.2, 0.7.1, 0.7, 0.8-dev
>     Reporter: Andrzej Bialecki 
>  Attachments: summarizer.060506.patch
>
> Summarizer.java tries to select the best fragments from the input text, where the frequency of query terms is the highest. However, the logic in line 223 is flawed in that the excerptSet.add() operation will add new excerpts only if they are not already present - the test is performed using the Comparator that compares only the numUniqueTokens. This means that if there are two or more excerpts, which score equally high, only the first of them will be retained, and the rest of equally-scoring excerpts will be discarded, in favor of other excerpts (possibly lower scoring).
> To fix this the Set should be replaced with a List + a sort operation. To keep the relative position of excerpts in the original order the Excerpt class should be extended with an "int order" field, and the collected excerpts should be sorted in that order prior to adding them to the summary.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (NUTCH-134) Summarizer doesn't select the best snippets

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-134?page=all ]

Jerome Charron updated NUTCH-134:
---------------------------------

    Attachment: summarizer.060506.patch

Here is a patch that add a summarizer extension point and two summarizer plugins : summarizer-basic (the current nutch implementation) and summarizer-lucene (the lucene highlighter implementation).
Please notice that the lucene plugin is a very crude implementation : the highlighter directly constructs a text representation of the summary, so we need to parse the text to build a Summary object!!! (improvements are welcome).

This is a first step to this issue resolution.
If no objection, I will commit this patch in the next few days and then:
1. Fix in the summarizer-basic the original issue reported by Andrzej 
2. Add a toString(Encoder, Formatter) method in Summarizer so that a Summary object could be encoded and formatted with many implementations (it is the same logic as the one in Lucene Highlight) - Andrzej, do you prefer this solution or a solution where Summary is Writable?

PS: Chris, sorry but the major part of this patch was already done when you added your comment.

> Summarizer doesn't select the best snippets
> -------------------------------------------
>
>          Key: NUTCH-134
>          URL: http://issues.apache.org/jira/browse/NUTCH-134
>      Project: Nutch
>         Type: Bug

>   Components: searcher
>     Versions: 0.7.2, 0.7.1, 0.7, 0.8-dev
>     Reporter: Andrzej Bialecki 
>  Attachments: summarizer.060506.patch
>
> Summarizer.java tries to select the best fragments from the input text, where the frequency of query terms is the highest. However, the logic in line 223 is flawed in that the excerptSet.add() operation will add new excerpts only if they are not already present - the test is performed using the Comparator that compares only the numUniqueTokens. This means that if there are two or more excerpts, which score equally high, only the first of them will be retained, and the rest of equally-scoring excerpts will be discarded, in favor of other excerpts (possibly lower scoring).
> To fix this the Set should be replaced with a List + a sort operation. To keep the relative position of excerpts in the original order the Excerpt class should be extended with an "int order" field, and the collected excerpts should be sorted in that order prior to adding them to the summary.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Resolved: (NUTCH-134) Summarizer doesn't select the best snippets

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-134?page=all ]
     
Jerome Charron resolved NUTCH-134:
----------------------------------

    Fix Version: 0.8-dev
     Resolution: Fixed
      Assign To: Jerome Charron

Solution proposed by Andrzej implemented.

> Summarizer doesn't select the best snippets
> -------------------------------------------
>
>          Key: NUTCH-134
>          URL: http://issues.apache.org/jira/browse/NUTCH-134
>      Project: Nutch
>         Type: Bug

>   Components: searcher
>     Versions: 0.7, 0.8-dev, 0.7.1, 0.7.2
>     Reporter: Andrzej Bialecki 
>     Assignee: Jerome Charron
>      Fix For: 0.8-dev
>  Attachments: summarizer.060506.patch
>
> Summarizer.java tries to select the best fragments from the input text, where the frequency of query terms is the highest. However, the logic in line 223 is flawed in that the excerptSet.add() operation will add new excerpts only if they are not already present - the test is performed using the Comparator that compares only the numUniqueTokens. This means that if there are two or more excerpts, which score equally high, only the first of them will be retained, and the rest of equally-scoring excerpts will be discarded, in favor of other excerpts (possibly lower scoring).
> To fix this the Set should be replaced with a List + a sort operation. To keep the relative position of excerpts in the original order the Excerpt class should be extended with an "int order" field, and the collected excerpts should be sorted in that order prior to adding them to the summary.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12377748 ] 

Andrzej Bialecki  commented on NUTCH-134:
-----------------------------------------

Please consider changing the API to return an array of writables as a result, instead of the current single String/UTF8. There are many applications (non-html output, clustering) that could greatly benefit from this change.

> Summarizer doesn't select the best snippets
> -------------------------------------------
>
>          Key: NUTCH-134
>          URL: http://issues.apache.org/jira/browse/NUTCH-134
>      Project: Nutch
>         Type: Bug

>   Components: searcher
>     Versions: 0.7.2, 0.7.1, 0.7, 0.8-dev
>     Reporter: Andrzej Bialecki 

>
> Summarizer.java tries to select the best fragments from the input text, where the frequency of query terms is the highest. However, the logic in line 223 is flawed in that the excerptSet.add() operation will add new excerpts only if they are not already present - the test is performed using the Comparator that compares only the numUniqueTokens. This means that if there are two or more excerpts, which score equally high, only the first of them will be retained, and the rest of equally-scoring excerpts will be discarded, in favor of other excerpts (possibly lower scoring).
> To fix this the Set should be replaced with a List + a sort operation. To keep the relative position of excerpts in the original order the Excerpt class should be extended with an "int order" field, and the collected excerpts should be sorted in that order prior to adding them to the summary.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12378458 ] 

Doug Cutting commented on NUTCH-134:
------------------------------------

+1 for Summary as Writable and change HitSummarizer.getSummary() to return a Summary directly rather than a String.  I don't think this has bad performance implications.

> Summarizer doesn't select the best snippets
> -------------------------------------------
>
>          Key: NUTCH-134
>          URL: http://issues.apache.org/jira/browse/NUTCH-134
>      Project: Nutch
>         Type: Bug

>   Components: searcher
>     Versions: 0.7.2, 0.7.1, 0.7, 0.8-dev
>     Reporter: Andrzej Bialecki 
>  Attachments: summarizer.060506.patch
>
> Summarizer.java tries to select the best fragments from the input text, where the frequency of query terms is the highest. However, the logic in line 223 is flawed in that the excerptSet.add() operation will add new excerpts only if they are not already present - the test is performed using the Comparator that compares only the numUniqueTokens. This means that if there are two or more excerpts, which score equally high, only the first of them will be retained, and the rest of equally-scoring excerpts will be discarded, in favor of other excerpts (possibly lower scoring).
> To fix this the Set should be replaced with a List + a sort operation. To keep the relative position of excerpts in the original order the Excerpt class should be extended with an "int order" field, and the collected excerpts should be sorted in that order prior to adding them to the summary.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

Posted by "byron miller (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12359649 ] 

byron miller commented on NUTCH-134:
------------------------------------

I would take more cpu for better summaries any day :) cpu power is cheaper than manual intervention!

If any testing is needed, don't hesitate to drop me a patch.. i've been working on a 500million page index using mapred branch on a 10 node cluster so i have plenty of numbers to test against.

> Summarizer doesn't select the best snippets
> -------------------------------------------
>
>          Key: NUTCH-134
>          URL: http://issues.apache.org/jira/browse/NUTCH-134
>      Project: Nutch
>         Type: Bug
>   Components: searcher
>     Versions: 0.7, 0.8-dev, 0.7.1, 0.7.2-dev
>     Reporter: Andrzej Bialecki 

>
> Summarizer.java tries to select the best fragments from the input text, where the frequency of query terms is the highest. However, the logic in line 223 is flawed in that the excerptSet.add() operation will add new excerpts only if they are not already present - the test is performed using the Comparator that compares only the numUniqueTokens. This means that if there are two or more excerpts, which score equally high, only the first of them will be retained, and the rest of equally-scoring excerpts will be discarded, in favor of other excerpts (possibly lower scoring).
> To fix this the Set should be replaced with a List + a sort operation. To keep the relative position of excerpts in the original order the Excerpt class should be extended with an "int order" field, and the collected excerpts should be sorted in that order prior to adding them to the summary.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira