You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "David Smiley (JIRA)" <ji...@apache.org> on 2014/03/16 05:46:57 UTC

[jira] [Updated] (LUCENE-5381) Lucene highlighter doesn't honor hl.fragsize; it appends all text for last fragment

     [ https://issues.apache.org/jira/browse/LUCENE-5381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Smiley updated LUCENE-5381:
---------------------------------

    Fix Version/s:     (was: 4.7)
                   4.8

> Lucene highlighter doesn't honor hl.fragsize; it appends all text for last fragment
> -----------------------------------------------------------------------------------
>
>                 Key: LUCENE-5381
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5381
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/highlighter
>    Affects Versions: 4.0, 4.6
>            Reporter: yuanyun.cn
>            Priority: Minor
>              Labels: highlighter, lucene
>             Fix For: 4.8, 5.0
>
>         Attachments: LUCENE-5381.patch
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Recently, we hit a problem related with highlighter: I set hl.fragsize = 300, but the highlight section for one document outputs more than 2000 characters.
> Look into the code, in org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(TokenStream, String, boolean, int),  after the for loop, it appends whole remaining text into last fragment.
> if (
> 		// if there is text beyond the last token considered..
> 		(lastEndOffset < text.length())
> 		&&
> 		// and that text is not too large...
> 		(text.length()<= maxDocCharsToAnalyze)
> 	)
> {
> 	//append it to the last fragment
> 	newText.append(encoder.encodeText(text.substring(lastEndOffset)));
> }
> currentFrag.textEndPos = newText.length();
> This code is problematical, as in some cases, the last fragment is the most relevant section and will be selected to return to client.
> I made some change to the code like below:  Now it works.
> //Test what remains of the original text beyond the point where we stopped analyzing
> if(lastEndOffset < text.length())
> {
> 	if(textFragmenter instanceof SimpleFragmenter)
> 	{
> 		SimpleFragmenter simpleFragmenter = (SimpleFragmenter) textFragmenter;
> 		int remain =simpleFragmenter.getFragmentSize() -(newText.length() - currentFrag.textStartPos);
> 		if(remain > 0 )
> 		{
> 			int endIndex = lastEndOffset + remain;
> 			if (endIndex > text.length()) {
> 				endIndex = text.length();
> 			}
> 			newText.append(encoder.encodeText(text.substring(lastEndOffset,
> 					endIndex)));
> 		}
> 	}
> 	else
> 	{
> 		newText.append(encoder.encodeText(text.substring(lastEndOffset)));
> 	}
> }
> currentFrag.textEndPos = newText.length();



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org